687 research outputs found
Making Differential Privacy Easier to Use for Data Controllers and Data Analysts using a Privacy Risk Indicator and an Escrow-Based Platform
Differential privacy (DP) enables private data analysis but is hard to use in
practice. For data controllers who decide what output to release, choosing the
amount of noise to add to the output is a non-trivial task because of the
difficulty of interpreting the privacy parameter . For data analysts
who submit queries, it is hard to understand the impact of the noise introduced
by DP on their tasks.
To address these two challenges: 1) we define a privacy risk indicator that
indicates the impact of choosing on individuals' privacy and use
that to design an algorithm that chooses automatically; 2) we
introduce a utility signaling protocol that helps analysts interpret the impact
of DP on their downstream tasks. We implement the algorithm and the protocol
inside a new platform built on top of a data escrow, which allows the
controller to control the data flow and achieve trustworthiness while
maintaining high performance. We demonstrate our contributions through an
IRB-approved user study, extensive experimental evaluations, and comparison
with other DP platforms. All in all, our work contributes to making DP easier
to use by lowering adoption barriers
Solo: Data Discovery Using Natural Language Questions Via A Self-Supervised Approach
Most deployed data discovery systems, such as Google Datasets, and open data
portals only support keyword search. Keyword search is geared towards general
audiences but limits the types of queries the systems can answer. We propose a
new system that lets users write natural language questions directly. A major
barrier to using this learned data discovery system is it needs
expensive-to-collect training data, thus limiting its utility. In this paper,
we introduce a self-supervised approach to assemble training datasets and train
learned discovery systems without human intervention. It requires addressing
several challenges, including the design of self-supervised strategies for data
discovery, table representation strategies to feed to the models, and relevance
models that work well with the synthetically generated questions. We combine
all the above contributions into a system, Solo, that solves the problem end to
end. The evaluation results demonstrate the new techniques outperform
state-of-the-art approaches on well-known benchmarks. All in all, the technique
is a stepping stone towards building learned discovery systems. The code is
open-sourced at https://github.com/TheDataStation/soloComment: To appear at Sigmod 202
Stateful data-parallel processing
Democratisation of data means that more people than ever are involved in the data analysis process. This is beneficial—it brings domain-specific knowledge from broad fields—but data scientists do not have adequate tools to write algorithms and execute them at scale. Processing models of current data-parallel processing systems, designed for scalability and fault tolerance, are stateless. Stateless processing facilitates capturing parallelisation opportunities and hides fault tolerance. However, data scientists want to write stateful programs—with explicit state that they can update, such as matrices in machine learning algorithms—and are used to imperative-style languages. These programs struggle to execute with high-performance in stateless data-parallel systems.
Representing state explicitly makes data-parallel processing at scale challenging. To achieve scalability, state must be distributed and coordinated across machines. In the event of failures, state must be recovered to provide correct results. We introduce stateful data-parallel processing that addresses the previous challenges by: (i) representing state as a first-class citizen so that a system can manipulate it; (ii) introducing two distributed mutable state abstractions for scalability; and (iii) an integrated approach to scale out and fault tolerance that recovers large state—spanning the memory of multiple machines. To support imperative-style programs a static analysis tool analyses Java programs that manipulate state and translates them to a representation that can execute on SEEP, an implementation of a stateful data-parallel processing model. SEEP is evaluated with stateful Big Data applications and shows comparable or better performance than state-of-the-art stateless systems.Open Acces
What Does it Take to be a Social Agent?
The aim of this paper is to present a philosophically inspired list of minimal requirements for social agency that may serve as a guideline for social robotics. Such a list does not aim at detailing the cognitive processes behind sociality but at providing an implementation-free characterization of the capacities and skills associated with sociality. We employ the notion of intentional stance as a methodological ground to study intentional agency and extend it into a social stance that takes into account social features of behavior. We discuss the basic requirements of sociality and different ways to understand them, and suggest some potential benefits of understanding them in an instrumentalist way in the context of social robotics.The aim of this paper is to present a philosophically inspired list of minimal requirements for social agency that may serve as a guideline for social robotics. Such a list does not aim at detailing the cognitive processes behind sociality but at providing an implementation-free characterization of the capacities and skills associated with sociality. We employ the notion of intentional stance as a methodological ground to study intentional agency and extend it into a social stance that takes into account social features of behavior. We discuss the basic requirements of sociality and different ways to understand them, and suggest some potential benefits of understanding them in an instrumentalist way in the context of social robotics.Peer reviewe
METAM: Goal-Oriented Data Discovery
Data is a central component of machine learning and causal inference tasks.
The availability of large amounts of data from sources such as open data
repositories, data lakes and data marketplaces creates an opportunity to
augment data and boost those tasks' performance. However, augmentation
techniques rely on a user manually discovering and shortlisting useful
candidate augmentations. Existing solutions do not leverage the synergy between
discovery and augmentation, thus under exploiting data.
In this paper, we introduce METAM, a novel goal-oriented framework that
queries the downstream task with a candidate dataset, forming a feedback loop
that automatically steers the discovery and augmentation process. To select
candidates efficiently, METAM leverages properties of the: i) data, ii) utility
function, and iii) solution set size. We show METAM's theoretical guarantees
and demonstrate those empirically on a broad set of tasks. All in all, we
demonstrate the promise of goal-oriented data discovery to modern data science
applications.Comment: ICDE 2023 pape
Niffler: A Reference Architecture and System Implementation for View Discovery over Pathless Table Collections by Example
Identifying a project-join view (PJ-view) over collections of tables is the
first step of many data management projects, e.g., assembling a dataset to feed
into a business intelligence tool, creating a training dataset to fit a machine
learning model, and more. When the table collections are large and lack join
information--such as when combining databases, or on data lakes--query by
example (QBE) systems can help identify relevant data, but they are designed
under the assumption that join information is available in the schema, and do
not perform well on pathless table collections that do not have join path
information.
We present a reference architecture that explicitly divides the end-to-end
problem of discovering PJ-views over pathless table collections into a human
and a technical problem. We then present Niffler, a system built to address the
technical problem. We introduce algorithms for the main components of Niffler,
including a signal generation component that helps reduce the size of the
candidate views that may be large due to errors and ambiguity in both the data
and input queries. We evaluate Niffler on real datasets to demonstrate the
effectiveness of the new engine in discovering PJ-views over pathless table
collections
- …